{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Simple RAG (Retrieval-Augmented Generation) System for CSV Files\n",
        "\n",
        "## Overview\n",
        "\n",
        "This code implements a basic Retrieval-Augmented Generation (RAG) system for processing and querying CSV documents. The system encodes the document content into a vector store, which can then be queried to retrieve relevant information.\n",
        "\n",
        "# CSV File Structure and Use Case\n",
        "The CSV file contains dummy customer data, comprising various attributes like first name, last name, company, etc. This dataset will be utilized for a RAG use case, facilitating the creation of a customer information Q&A system.\n",
        "\n",
        "## Key Components\n",
        "\n",
        "1. Loading and spliting csv files.\n",
        "2. Vector store creation using [FAISS](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/) and OpenAI embeddings\n",
        "3. Retriever setup for querying the processed documents\n",
        "4. Creating a question and answer over the csv data.\n",
        "\n",
        "## Method Details\n",
        "\n",
        "### Document Preprocessing\n",
        "\n",
        "1. The csv is loaded using langchain Csvloader\n",
        "2. The data is split into chunks.\n",
        "\n",
        "\n",
        "### Vector Store Creation\n",
        "\n",
        "1. OpenAI embeddings are used to create vector representations of the text chunks.\n",
        "2. A FAISS vector store is created from these embeddings for efficient similarity search.\n",
        "\n",
        "### Retriever Setup\n",
        "\n",
        "1. A retriever is configured to fetch the most relevant chunks for a given query.\n",
        "\n",
        "## Benefits of this Approach\n",
        "\n",
        "1. Scalability: Can handle large documents by processing them in chunks.\n",
        "2. Flexibility: Easy to adjust parameters like chunk size and number of retrieved results.\n",
        "3. Efficiency: Utilizes FAISS for fast similarity search in high-dimensional spaces.\n",
        "4. Integration with Advanced NLP: Uses OpenAI embeddings for state-of-the-art text representation.\n",
        "\n",
        "## Conclusion\n",
        "\n",
        "This simple RAG system provides a solid foundation for building more complex information retrieval and question-answering systems. By encoding document content into a searchable vector store, it enables efficient retrieval of relevant information in response to queries. This approach is particularly useful for applications requiring quick access to specific information within a csv file."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "import libries"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Package Installation and Imports\n",
        "\n",
        "The cell below installs all necessary packages required to run this notebook.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Install required packages\n",
        "!pip install faiss-cpu langchain langchain-community langchain-openai pandas python-dotenv"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 8,
      "metadata": {},
      "outputs": [],
      "source": [
        "from langchain_community.document_loaders.csv_loader import CSVLoader\n",
        "from pathlib import Path\n",
        "from langchain_openai import ChatOpenAI,OpenAIEmbeddings\n",
        "import os\n",
        "from dotenv import load_dotenv\n",
        "\n",
        "# Load environment variables from a .env file\n",
        "load_dotenv()\n",
        "\n",
        "# Set the OpenAI API key environment variable\n",
        "os.environ[\"OPENAI_API_KEY\"] = os.getenv('OPENAI_API_KEY')\n",
        "\n",
        "llm = ChatOpenAI(model=\"gpt-3.5-turbo-0125\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# CSV File Structure and Use Case\n",
        "The CSV file contains dummy customer data, comprising various attributes like first name, last name, company, etc. This dataset will be utilized for a RAG use case, facilitating the creation of a customer information Q&A system."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Download required data files\n",
        "import os\n",
        "os.makedirs('data', exist_ok=True)\n",
        "\n",
        "# Download the PDF document used in this notebook\n",
        "!wget -O data/Understanding_Climate_Change.pdf https://raw.githubusercontent.com/NirDiamant/RAG_TECHNIQUES/main/data/Understanding_Climate_Change.pdf\n",
        "!wget -O data/customers-100.csv https://raw.githubusercontent.com/NirDiamant/RAG_TECHNIQUES/main/data/customers-100.csv\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 18,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>Index</th>\n",
              "      <th>Customer Id</th>\n",
              "      <th>First Name</th>\n",
              "      <th>Last Name</th>\n",
              "      <th>Company</th>\n",
              "      <th>City</th>\n",
              "      <th>Country</th>\n",
              "      <th>Phone 1</th>\n",
              "      <th>Phone 2</th>\n",
              "      <th>Email</th>\n",
              "      <th>Subscription Date</th>\n",
              "      <th>Website</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>1</td>\n",
              "      <td>DD37Cf93aecA6Dc</td>\n",
              "      <td>Sheryl</td>\n",
              "      <td>Baxter</td>\n",
              "      <td>Rasmussen Group</td>\n",
              "      <td>East Leonard</td>\n",
              "      <td>Chile</td>\n",
              "      <td>229.077.5154</td>\n",
              "      <td>397.884.0519x718</td>\n",
              "      <td>zunigavanessa@smith.info</td>\n",
              "      <td>2020-08-24</td>\n",
              "      <td>http://www.stephenson.com/</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>2</td>\n",
              "      <td>1Ef7b82A4CAAD10</td>\n",
              "      <td>Preston</td>\n",
              "      <td>Lozano</td>\n",
              "      <td>Vega-Gentry</td>\n",
              "      <td>East Jimmychester</td>\n",
              "      <td>Djibouti</td>\n",
              "      <td>5153435776</td>\n",
              "      <td>686-620-1820x944</td>\n",
              "      <td>vmata@colon.com</td>\n",
              "      <td>2021-04-23</td>\n",
              "      <td>http://www.hobbs.com/</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>3</td>\n",
              "      <td>6F94879bDAfE5a6</td>\n",
              "      <td>Roy</td>\n",
              "      <td>Berry</td>\n",
              "      <td>Murillo-Perry</td>\n",
              "      <td>Isabelborough</td>\n",
              "      <td>Antigua and Barbuda</td>\n",
              "      <td>+1-539-402-0259</td>\n",
              "      <td>(496)978-3969x58947</td>\n",
              "      <td>beckycarr@hogan.com</td>\n",
              "      <td>2020-03-25</td>\n",
              "      <td>http://www.lawrence.com/</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>4</td>\n",
              "      <td>5Cef8BFA16c5e3c</td>\n",
              "      <td>Linda</td>\n",
              "      <td>Olsen</td>\n",
              "      <td>Dominguez, Mcmillan and Donovan</td>\n",
              "      <td>Bensonview</td>\n",
              "      <td>Dominican Republic</td>\n",
              "      <td>001-808-617-6467x12895</td>\n",
              "      <td>+1-813-324-8756</td>\n",
              "      <td>stanleyblackwell@benson.org</td>\n",
              "      <td>2020-06-02</td>\n",
              "      <td>http://www.good-lyons.com/</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>5</td>\n",
              "      <td>053d585Ab6b3159</td>\n",
              "      <td>Joanna</td>\n",
              "      <td>Bender</td>\n",
              "      <td>Martin, Lang and Andrade</td>\n",
              "      <td>West Priscilla</td>\n",
              "      <td>Slovakia (Slovak Republic)</td>\n",
              "      <td>001-234-203-0635x76146</td>\n",
              "      <td>001-199-446-3860x3486</td>\n",
              "      <td>colinalvarado@miles.net</td>\n",
              "      <td>2021-04-17</td>\n",
              "      <td>https://goodwin-ingram.com/</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "   Index      Customer Id First Name Last Name  \\\n",
              "0      1  DD37Cf93aecA6Dc     Sheryl    Baxter   \n",
              "1      2  1Ef7b82A4CAAD10    Preston    Lozano   \n",
              "2      3  6F94879bDAfE5a6        Roy     Berry   \n",
              "3      4  5Cef8BFA16c5e3c      Linda     Olsen   \n",
              "4      5  053d585Ab6b3159     Joanna    Bender   \n",
              "\n",
              "                           Company               City  \\\n",
              "0                  Rasmussen Group       East Leonard   \n",
              "1                      Vega-Gentry  East Jimmychester   \n",
              "2                    Murillo-Perry      Isabelborough   \n",
              "3  Dominguez, Mcmillan and Donovan         Bensonview   \n",
              "4         Martin, Lang and Andrade     West Priscilla   \n",
              "\n",
              "                      Country                 Phone 1                Phone 2  \\\n",
              "0                       Chile            229.077.5154       397.884.0519x718   \n",
              "1                    Djibouti              5153435776       686-620-1820x944   \n",
              "2         Antigua and Barbuda         +1-539-402-0259    (496)978-3969x58947   \n",
              "3          Dominican Republic  001-808-617-6467x12895        +1-813-324-8756   \n",
              "4  Slovakia (Slovak Republic)  001-234-203-0635x76146  001-199-446-3860x3486   \n",
              "\n",
              "                         Email Subscription Date                      Website  \n",
              "0     zunigavanessa@smith.info        2020-08-24   http://www.stephenson.com/  \n",
              "1              vmata@colon.com        2021-04-23        http://www.hobbs.com/  \n",
              "2          beckycarr@hogan.com        2020-03-25     http://www.lawrence.com/  \n",
              "3  stanleyblackwell@benson.org        2020-06-02   http://www.good-lyons.com/  \n",
              "4      colinalvarado@miles.net        2021-04-17  https://goodwin-ingram.com/  "
            ]
          },
          "execution_count": 18,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "import pandas as pd\n",
        "\n",
        "file_path = ('data/customers-100.csv') # insert the path of the csv file\n",
        "data = pd.read_csv(file_path)\n",
        "\n",
        "#preview the csv file\n",
        "data.head()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "load and process csv data"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 19,
      "metadata": {},
      "outputs": [],
      "source": [
        "loader = CSVLoader(file_path=file_path)\n",
        "docs = loader.load_and_split()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Initiate faiss vector store and openai embedding"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 9,
      "metadata": {},
      "outputs": [],
      "source": [
        "import faiss\n",
        "from langchain_community.docstore.in_memory import InMemoryDocstore\n",
        "from langchain_community.vectorstores import FAISS\n",
        "\n",
        "embeddings = OpenAIEmbeddings()\n",
        "index = faiss.IndexFlatL2(len(OpenAIEmbeddings().embed_query(\" \")))\n",
        "vector_store = FAISS(\n",
        "    embedding_function=OpenAIEmbeddings(),\n",
        "    index=index,\n",
        "    docstore=InMemoryDocstore(),\n",
        "    index_to_docstore_id={}\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Add the splitted csv data to the vector store"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "vector_store.add_documents(documents=docs)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Create the retrieval chain"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 11,
      "metadata": {},
      "outputs": [],
      "source": [
        "from langchain_core.prompts import ChatPromptTemplate\n",
        "from langchain.chains import create_retrieval_chain\n",
        "from langchain.chains.combine_documents import create_stuff_documents_chain\n",
        "\n",
        "retriever = vector_store.as_retriever()\n",
        "\n",
        "# Set up system prompt\n",
        "system_prompt = (\n",
        "    \"You are an assistant for question-answering tasks. \"\n",
        "    \"Use the following pieces of retrieved context to answer \"\n",
        "    \"the question. If you don't know the answer, say that you \"\n",
        "    \"don't know. Use three sentences maximum and keep the \"\n",
        "    \"answer concise.\"\n",
        "    \"\\n\\n\"\n",
        "    \"{context}\"\n",
        ")\n",
        "\n",
        "prompt = ChatPromptTemplate.from_messages([\n",
        "    (\"system\", system_prompt),\n",
        "    (\"human\", \"{input}\"),\n",
        "    \n",
        "])\n",
        "\n",
        "# Create the question-answer chain\n",
        "question_answer_chain = create_stuff_documents_chain(llm, prompt)\n",
        "rag_chain = create_retrieval_chain(retriever, question_answer_chain)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Query the rag bot with a question based on the CSV data"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 15,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/plain": [
              "'Sheryl Baxter works for Rasmussen Group.'"
            ]
          },
          "execution_count": 15,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "answer= rag_chain.invoke({\"input\": \"which company does sheryl Baxter work for?\"})\n",
        "answer['answer']"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "![](https://europe-west1-rag-techniques-views-tracker.cloudfunctions.net/rag-techniques-tracker?notebook=all-rag-techniques--simple-csv-rag)"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "name": "",
      "provenance": [],
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.11.4"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 2
}